ARCTIC-3D: automatic retrieval and clustering of interfaces in complexes from 3D structural information

The formation of a stable complex between proteins lies at the core of a wide variety of biological processes and has been the focus of countless experiments. The huge amount of information contained in the protein structural interactome in the Protein Data Bank can now be used to characterise and classify the existing biological interfaces. We here introduce ARCTIC-3D, a fast and user-friendly data mining and clustering software to retrieve data and rationalise the interface information associated with the protein input data. We demonstrate its use by various examples ranging from showing the increased interaction complexity of eukaryotic proteins, 20% of which on average have more than 3 different interfaces compared to only 10% for prokaryotes, to associating different functions to different interfaces. In the context of modelling biomolecular assemblies, we introduce the concept of “recognition entropy”, related to the number of possible interfaces of the components of a protein-protein complex, which we demonstrate to correlate with the modelling difficulty in classical docking approaches. The identified interface clusters can also be used to generate various combinations of interface-specific restraints for integrative modelling. The ARCTIC-3D software is freely available at github.com/haddocking/arctic3d and can be accessed as a web-service at wenmr.science.uu.nl/arctic3d.


2.
Understanding and visualizing angles and distances between interfaces p.5

3.
On the clustering cutoff threshold p.8

4.
Retention of alternative interfaces: an example p.10

5.
Applying arctic3d-resclust to CPORT predictions p.11 Supplementary References p.12 1 Software Performance Analysis Supplementary Figure 1: Box plot showing the execution time (in logarithmic scale) proper to each step of the ARCTIC-3D protocol for the 23446 UNIPROT IDs for which interface information is available (see main text, Sec.UNIPROT-wide analysis).The PDB retrieval stage is usually the computational bottleneck of the procedure, accounting for more than 60 % of the execution time on average.
In this section we break down the performance of ARCTIC-3D based on the 23446 proteins discussed in the main text (Sec.UNIPROT-wide analysis).Those were the UNIPROT IDs for which interface information is available on the PDBe Graph Api.
Average and median execution times associated to the five different stages of the protocol, together with the overall execution time, are presented in Table 1, while Fig. 1 shows the corresponding boxplot.The Interface Retrieval step, in which interfaces are retrieved and parsed is never a problem in terms of computational time.The PDB retrieval stage, instead, is typically the bottleneck of the protocol, as the code has to retrieve and download several PDB files to pick the one retaining the highest amount of interfaces.The timing related to these steps are only indicative as they heavily depend on the network connection speed.
Execution speed of Interface Matrix and Clustering steps is linked to the number of retrieved interfaces, as these steps scale quadratically (O(N 2 ) for Interface Matrix) or worse (O(N 2 logN ) for Clustering ) with this number.Fig. 2 shows the scatter plot between Interface Matrix execution time and number of overall residues and interfaces.There is a dependency on the number of amino acids, given by the presence of more gaussian couplings in the dissimilarity calculation, but the main limiting factor is clearly given by the number of interfaces.
The Output stage takes typically a few hundreds milliseconds to complete, reaching the seconds timescale only when multiple, big PDB files have to be written to the disk.
The pdb-to-use and chain-to-use parameters allow to select the PDB file and chain to be retrieved in the PDB retrieval stage.If these values are defined, this stage's computational cost decreases dramatically, thus approximately halving ARCTIC-3D average execution time.An important ingredient of the ARCTIC-3D protocol revolves around the definition of the quantity that should be used to discriminate between different interfaces.In the main text (Methods section) we proposed to use the sine of the angle between two interfaces rather than the distance between them.This is because this quantity is not robust with respect to differences in the numbers of amino acids between the various interfaces.Although they differ by a few residues (INT3 has a few more amino acids in the lower domain), they share a very substantial fraction of interface residues.Biologically, they represent the same interface.
If we look at the numbers shown in Fig. 3 we can see how the sine of the angle properly describes the previous observations, while the distance does not.In fact, the distance between INT1 and INT2 would be approximately equal to the one between INT3 and INT4, thus contradicting the statement that the latter are extremely closer than the former.Note that the distance we are referring to is not the simple cartesian distance between the centers of the interfaces but is calculated according to Eq. 5 described in the Methods section of the main manuscript.The use of the angle solves this issue, as it only measures the overlap between interfaces, irrespectively of the number of amino acids forming them.The default value of the cutoff in ARCTIC-3D clustering is 0.866, namely the value of the sine of an angle θ equal to 60 degrees.Such threshold may not be perfect for all needs, as a stricter or a looser clustering may be more appropriate in some cases.Users can check the ARCTIC-3D output dendrogram (see Fig. 4 for an example) and adjust the threshold parameter to fine-tune the clustering.
Here we aim at illustrating the impact of the threshold on the generated clustering data for the proteins of the BM5 dataset (discussed in the main text, Sec.Results).We re-ran the analysis with eight, evenly spaced, values of the angle threshold, going from 10 to 80.
Fig. 5 shows the relationship between the angle clustering threshold and the number of generated clusters.When the threshold angle is lower than 50 the clustering procedure is very strict, thus giving rise to a high number of binding surfaces.In the limit case of θ = 10, many proteins are associated to more than 25 clusters, thus making the cluster analysis cumbersome.Instead, high clustering thresholds (θ = 80) tend to lump all the interfaces in very few binding surfaces.We find θ = 60 to be a good compromise between these two scenarios.
From Fig. 5 it is also possible to understand how the same difference (e.g. 10 degrees) in the angle clustering threshold is typically more impactful at lower values, while being less significant at higher values (for example going from 70 to 80 degrees).Using the sine of such angle removes this dependency.Fig. 6 shows the results of the application of ARCTIC-3D on nine interfaces retrieved for the Small ribosomal subunit protein uS5 (UNIPROT ID A0QSG6).
On the left side of the figure we see the dendrogram that illustrates how these interfaces are compared to each other and merged in clusters.On the right, the three interfaces composing each cluster are projected onto the reference structure (PDB 5xyu, chain E).The corresponding ARCTIC-3D clusters are shown, using the color coding described in the main text.
5 Applying arctic3d-resclust to CPORT predictions As an example scenario for the arctic3d-resclust CLI, we retrieve a set of possibly interacting residues for pdb file 3HMR (2) using CPORT (3) with default settings.CPORT is a software dedicated to the prediction of protein-protein interface amino acids by combining up to six different predictors.Upon providing arctic3d-resclust with the structure and the list of possibly interacting amino acids, we obtain two clusters (see N A l m E Y s p Z P e p J 3 O S V g p L a 9 r 7 n M N a H P g L T j 2 7 u y e 4 x N L 4 y O 8 C V 2 8 0 7 b z j p 5 Z w I 8 T 9 y r 5 j 2 i S I U T H Q Z E 4 I 7 e / I 8 q R b y 7 l n + 9 P 4 k W 7 x K 4 k i j P X S A c s h F 5 6 i I b l A J V R B F j + g Z v a I 3 6 8 l 6 s d 6t j + l o y k p 2 d t E f W J 8 / C B G W 1 g = = < / l a t e x i t > sin(✓ IN T 1,IN T 2 ) = 1.0 < l a t e x i t s h a 1 _ b a s e 6 4 = " E T E G t p 4 H n U 6 O Z g W d 2 L P d 8 y a g b E g = " > A A A C C H i c b V D L S g M x F M 3 4 r P U 1 6 t K F w S J U k D L T 1 s d G K L r R j V T o C 9 p S M m n a h m Y y Q 3 J H K E O X b v w V Ny 4 U c e s n u P N v T B 8 L b T 1 w L 4 d z 7 i W 5 x w s F 1 + A 4 3 9 b C 4 t L y y m p i L b m + s b m 1 b

Supplementary Figure 3 :
s 6 c 4 e + g P r 8 w e L U 5 c Y < / l a t e x i t > sin(✓ IN T 3,IN T 4 ) = 0.32 Visualization of four different interfaces on the structure of the DNA-directed RNA polymerase II RPB7 subunit (UNIPROT ID P62487, light blue in the picture).Interface residues are highlighted in orange in each figure, while the corresponding interacting partner is represented in gray.INT1 and INT2 are small but biologically different interfaces located on two different sides of the protein.INT3 and INT4 are bigger interfaces that mediate the interaction of SUMO-1 with the DNA-directed RNA polymerase II RPB4 subunit (UNIPROT ID O15514).Although INT3 and INT4 do not share the totality of their amino acids, their dissimilarity should be low, while it should be higher for INT1 and INT2.The distance does not reflect this, giving two very similar values.The sine of the angle (sin(θ)) is more effective, as it assigns the highest possible dissimilarity (sin(θ) = 1.0) to the first pair, while identifying INT3 and INT4 as closely related (sin(θ) = 0.32).Image produced with Molstar (1).

Fig. 3
Fig. 3 visualizes this issue for four interfaces formed by UNIPROT ID P62487 (DNA-directed RNA polymerase II subunit RPB7 ).INT1 refers to the interface formed with UNIPROT ID P51948 (CDKactivating kinase assembly factor MAT1 ) in PDB file 6o9l (chain 3).INT2 is related to UNIPROT ID P61218 (DNA-directed RNA polymerases I, II, and III subunit RPABC2 ) and is retrieved from PDB 5iyc, chain F. Clearly these two interfaces are located in two different regions of the protein and, therefore, should be structurally separated.INT3 and INT4 are two interfaces formed by UNIPROT ID P62487 with UNIPROT ID O15514 (DNA-directed RNA polymerase II subunit RPB4 ) in PDB files 6xre (chain D) and 6drd (chain D).

3 On the clustering cutoff threshold Supplementary Figure 5 :
Box plot showing the dependency of the number of clusters with the value of θ used in the clustering threshold.Orange lines represent median values, while green triangles represent average values.

4
Retention of alternative interfaces: an example Supplementary Figure 6: Example application of ARCTIC-3D to nine interfaces retrieved for UNIPROT ID A0QSG6.The interfaces are compared to each other in the dendrogram (left) and divided in three binding surfaces that are localised in different regions of the protein (right).Image produced with Molstar (1).

Table 1 :
Mean and median execution times for the five stages of the ARCTIC-3D protocol, together with the average and median overall execution time, calculated over the 23446 ARCTIC-3D runs on the UNIPROT swissprot database (see main text).